Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Cluster Scoped Resources #1400

Closed

Conversation

nikhildl12
Copy link

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 14, 2017
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 14, 2017
@k8s-github-robot k8s-github-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Nov 14, 2017
@@ -0,0 +1,76 @@
# Cluster Scoped Resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please follow the KEP process outlined by @kubernetes/sig-architecture-feature-requests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is KEP now a requirement or a recommendation? That was not clear from the contributor summit discussions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @jdumars

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster scoped resources are consumable resources that do not belong to any specific node but instead are available across mulitple nodes in a cluster. These resources are accounted as other consumable resources and should be usable by the scheduler while deciding if a pod can actually be scheduled.


## Motivation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Software licenses are the most common reason for such features in other systems.

@k8s-ci-robot k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. kind/feature Categorizes issue or PR as related to a new feature. labels Nov 22, 2017


## Motivation
Resources in Kubernetes such as cpu and memory are available at a node level and can be consumed by pods by requesting them. However there are some resources that do not belong a specific node, but they are consumable across all or a group of nodes in the cluster. As an example, IP addresses in a pool can be shared across pods running on multiple nodes in a network scope. Another use case could be, locally attached shared storage in a rack, which is consumable across several nodes. Hence there is a need to represent such a resource at cluster level which is consumable acroass all or a group of nodes in the cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more like a node group scoped resources in the examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a list of 5-8 example resources that would be tracked like this. I’d like more validation and concrete discussion on each type to guide design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. There are many use cases for extending resource APIs and I'd like to first get a collection of use-cases before identifying possible solutions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few use cases

@davidopp
Copy link
Member

cc/ @kubernetes/sig-scheduling-feature-requests
cc/ @vishh

@davidopp
Copy link
Member

Thanks for writing this, it's definitely a feature we have been talking about for a while.

I think a complete solution to this problem should consider how the resource allocator for the cluster-level resource fits in. I think that cluster-scoped resources are likely to have some kind of external allocator, for example the agent that hands out IP addresses or software licenses. It's important for the scheduler's view of free resources to stay in sync with that of the external allocator, which has the authoritative information, so that we can minimize the likelihood that a container starts up and finds that the resource is not actually available.

For example, with a normal resource the scheduler assumes the resources become freed when the pod terminates or is deleted. But with cluster-level resources, if we leave the allocation and deallocation of the resource up to the container, it might be possible to leak resources (container forgets to release the IP address or license, or gets killed before it can, so the resource is still allocated but the scheduler thinks it is free because the pod has terminated). So maybe the scheduler should be responsible for reserving the resource from the allocator before binding the pod, and unreserving the resource via the allocator when the pod terminates. It's probably quite complicated to ensure that a container only tries to allocate resources that have been reserved for it, so it's probably not a "secure" solution but might be good enough.

One approach is what we did for PDB and ResourceQuota, where decrementing the amount free is synchronous with requesting it (in the cluster-scoped resource case, this would mean the scheduler decrements the free) but replenishing the resource when it is no longer in use is asynchronous and done by a separate controller (could be the agent that is responsible for the cluster-level resource, when a container deallocates the resource).

}

// ClusterResource represents a resource which is available at a cluster level
type ClusterResource struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a form of quota, it should be named as such - ClusterResourceNodeQuota. It’s not actually clear how this api aligns with ResourceQuota, please comment to that effect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterResource is an api type that represents a cluster scoped resource. However it's integration with resourcequotas needs to be added, probably at later a phase such as beta?

// pkg/api/types.go:

// ClusterResourceQuantity represents quantity of a ClusterResource
type ClusterResourceQuantity struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the discovery/initialization flow?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster admin or other controllers will post the ClusterResource objects that captures the capacity and allocatable quantities of aClusterResource, which will then be used by scheduler

}
```

`clusterinfo` is added to scheduler cache to do accounting for ClusterResources consumed by pods. `clusterInfo` will be exposed to the predicate and priority functions in order to take ClusterResources into consideration while making scheduling decisions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How clusterinfo will be build?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterinfo can be build similar to how we build nodeInfo since scheduler will be watching for ClusterResources


ClusterResources are consumable by pods just like CPU and memory, by specifying it in the pod request. The scheduler should take care of the resource accounting for ClusterResources so that no more than the available amount is simultaneously allocated to Pods. The prefix used to identify a ClusterResource coule be
```
pod.alpha.kubernetes.io/cluster-resource-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of special prefixes. I'd like to see if we can avoid overloading resource names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we just moved away from this pattern with extended resources.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can follow fully-qualified resource names similar to extended resources, but we need to see how will those be differentiated

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nikhildl12
We suggest the following additional approver: davidopp

Assign the PR to them by writing /assign @davidopp in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@nikhildl12
Copy link
Author

@davidopp @timothysc @vishh: I would like to understand what could the next steps be for this proposal. As a first action item, I can submit this in the form of a KEP: https://github.com/kubernetes/community/blob/master/keps/0000-kep-template.md

@cblecker
Copy link
Member

@nikhildl12 One important step is to sort out your CLA, as outlined here: #1400 (comment)

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 29, 2018
pod.alpha.kubernetes.io/cluster-resource-
```

### Accounting in scheduler
Copy link
Member

@bsalamat bsalamat Jan 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have extended resources and several types of first class resources in the scheduler already. I think it would be possible to come up with a single presentation that covers all of these types. For example, I don't see much of a difference between a cluster resource and extended resource from scheduler's point of view. An extended resource with an additional "type" can represent a cluster resource.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key difference between these two resources is the "scope".
Currently extended resources are exposed as a part of node status because of their nature of being tied to a node, while cluster scoped resources have to be represented outside the scope of a node. But we can surely have a comprehensive API that covers both. From the scheduler's point of view, it will need some additional logic to calculate and cache the available capacity of a cluster scoped resource across a set of nodes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the "ResourceClass" that @jiayingz is working on is an effort in that direction to provide a comprehensive API to represent various types of resources, including cluster resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. As @bsalamat mentioned, we are working on a new Resource API proposal that aims to provide a comprehensive API for both node-level resources and cluster-level resources. Here is the current PR:
#782
It is still WIP and the current plan is to focus on node-level resources during the initial phase. But I think even the initial API should help solve some of the listed problems here. Please take a look and let us know if you see any missing pieces.


### Accounting in scheduler

ClusterResources should be tracked as normal consumable resources and should be considered by the scheduler when determining if a pod can actually be scheduled
Copy link
Member

@bsalamat bsalamat Jan 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another important aspect of cluster resources which is not covered here is how to bind these resources to a chosen node during/after scheduling. A fairly complex logic is already added to scheduler to handle provisioning and binding PVs to nodes during scheduling. Similar processes may be needed for other resources, such as TPUs, etc. I think that aspect should be covered by the proposal.

@vishh @jiayingz

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To cover that aspect, I prefer the approach mentioned by @davidopp in his previous comment. The external agent/controller which exposes the available capacity of this resource can be made responsible for binding or making sure that those resources are ready to use when a pod is going to run on a node. Similarly when a pod dies, that agent needs to deallocate/unbind the corresponding resource and increment the available quantity so that it can be used for scheduling of new pods

@k8s-github-robot k8s-github-robot added the kind/design Categorizes issue or PR as related to design. label Feb 6, 2018
@timothysc timothysc removed their assignment Apr 13, 2018
@timothysc timothysc dismissed their stale review April 13, 2018 21:29

out of date

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 11, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@krmayankk
Copy link

why has this been abandoned ? The proposal seems fair ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.